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I. INTRODUCTION 


The development of efficient public policies aimed at 
traffic management is extremely important for any large 
city. Problems with congestion and long queues have been 
a problem that has been a concern for public management 
in different parts of the world for decades. It is an 
inevitable phenomenon, but it can be mitigated through 
good public policies, which can increase people’s quality 
of life, in addition to reducing its impact on the 
environment. In this sense, a considerable amount of 
studies have been proposed to deal with this problem. In 
[1] two approaches are proposed based on graph theory to 
solve the problem of time information in public 
transportation systems, whereas in [2], techniques for route 
planning on public transportation networks are proposed. 
In [3], a study was conducted to verify the efficiency of 
public transport in smart cities, specifically in relation to 
vehicle fluidity. 


In this work we propose a multiple linear regression 
model to describe the relationship between traffic volume 
in the Minneapolis-St Paul metropolitan area as a function 
of several variables, which include weather conditions and 
holiday occurrences. Additionally, we present an analysis 
for a queue model with a general arrival process, denoted 
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by (G/M/1) : («/FIFO). This model deals with a queueing 
system where a single service channel, interarrival times 
and service times are independent and identically 
distributed random variables, given respectively by G, 
which represents a general probability distribution, and M 
modeled by a exponential distribution with a mean and 
PDF (Probability density function) and CDF (Cumulative 
distribution function) given by equations 1 and 2. 
Moreover, there is no limit on the system capacity while 
the customers are served on a first in, first out, basis. 


NET FECE) 
(0,00) (1) 
Te tT (2) 
(0,00) (2) 


In this sense, our research contributions can be 
summarized as follows: 


e We propose a multiple linear regression model to 
describe the relationship between traffic volume 
in the MinneapolisSt Paul metropolitan area as a 
function of several variables, which include 
weather conditions and holiday occurrences; 
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e We present an analysis for a model queue with a 
general arrival process, denoted by (G/M/1):(c0 / 
FIFO). We specifically investigate whether or not 
the arrivals follow a poisson distribution and 
which probability distribution is most suitable for 
the observed data; 


e In addition, the analysis of variance (ANOVA) 
test is performed in order to verify the 
contributions of each variable in the regression 
model. 


The rest of the paper is organized as follows: In section 
2, we present the related work. The problem statement is 
described in section 3. Experimental results are described 
in section 4 and conclusions are presented in section 5. 


Il. RELATED WORK 


There are several related works that address the topic in 
question, In [4] is proposed a Bayesian inference model for 
traffic prediction, capable of incorporating spatial and 
temporal components. Furthermore, according to the 
authors, the proposed solution works well with missing 
data points, taking advantage of previous information. In 
[5], a traffic forecast is proposed as a birth and death 
process to describe the behavior of vehicles on the road. 
[6] proposes to model the relationship between traffic 
congestion and weather. The authors used a multiple linear 
regression model to predict daily changes in congestion, 
based on eight weather forecast factors and six dummy 
variables to express the days of the week. In [7] is give an 
overview of some approaches for reducing and managing 
congestion so as to reduce this phenomenon, particularly, 
the effects of congestion on public transport. In [8] 
presents a novel integration of machine learning models 
into simulation to improve the realism of simulating a 
public transport system. The authors conclude that, with an 
efficient congestion prediction tool, it is possible to 
effectively predict the time delays in traffic. 


IW. PROBLEM STATEMENT 


Estimating traffic behavior in large cities is important 
for the development of public policies that minimize 
congestion, which affect people’s quality of life and 
impacts on the environment. In this sense, this study 
uses a multiple linear regression model, defined in the 
equation 3, to verify the relationship between the 
dependent variable traffic volume and a set of 
independent variables, such as the occurrence of 
holidays and weather conditions. In addition, it is 
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statistically evaluated which probability distribution (G) 
is more suitable for the G/M/1 model. 


Y = bBo + B1 Xı + BoXo+---+ ByXp +e (3) 


where Y represents the dependent variable; x1,x2 ---xp 
are the explanatory variables; Bi, for i = 1,2,...,p, are slope 
coefficents for explanatory variables and € is a random 


iid a7 2 
error with ~ N (0,07), 


IV. RESULTS 


In our experimental study, we considered the metro 
Interstate Traffic Volume Data Set, provided by the UCI 
data repository [9]. This dataset contains information about 
hourly from 20122018, Interstate 94 Westbound traffic 
volume for MN DoT ATR station 301, roughly midway 
between Minneapolis and St Paul, MN. Hourly weather 
features and holidays included for impacts on traffic 
volume. In this sense, we investigate the influence of 
variables like holidays and climate in relation the 
dependent variable variable traffic volume. The attributes 
informations are: 


e Holiday Categorical US National holidays plus 
regional holiday, Minnesota State Fair; 


e Temperature in kelvin; 
¢ Amount in mm of rain that occurred in the hour; 
¢ Amount in mm of snow that occurred in the hour; 
e Numeric Percentage of cloud cover; 
¢ Short textual description of the current weather; 
¢ Longer textual description of the current weather; 
¢ DateTime Hour of the data collected in local CST 
time; 
e Numeric Hourly I-94 ATR 301 reported 
westbound traffic volume. 


Figure | the box plot of vehicles arrivals between 2012 
and 2018, indicating that there is no significant difference 
between the observations, whereas figure 2 shows the 
empirical cumulative distribution function (CDF) of the 
data. 
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Fig. 2: CDF of vehicles arrivals between 2012 and 2018. 


According to section 3, we formulate multiple linear 
regression analyses to explore the relationships between 
traffic volume for metropolitan Minneapolis and the 
independent variables defined at the beginning of this 
section. The source code in listing 1, implemented in R 
language, shows the results for the model in question. Note 
that the R-squared, which measures the strength of the 
relationship between the dependent and independent 
variables, was 0.933. That is, 93.3% of the variance in the 


traffic_volume.1.8033. 


Fr(x) 


traffic_volume.16067.24099. 


Fig. 1: boxplots of traffic volume per year. 


0 2000 4000 6000 8000 


Traffic volume 


Regression results: 


data can be explained by the independent variables. 
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Call 
m(formula = traffic volume ~ holiday + rain + temperature + 


clouds + clear + mist + rain + snow + drizzle + haze + thunderstorm) 


Residuals 
Min 10 Median 32 Max 
-10486 -4526 -81.6 362.5 13503 


Coefficients 
1 Estimate Std 
(Intercept) 915.286 

5) holiday -33.384 
rain 4790.665 


5| temperature 1 
clouds 

7) clear 

mist 

snow 

drizzle 

haze 
thunderstorm 


Signif. codes: 0 see 0.001 +s 0.01 + 0.05. 0) 1 


Residual standard error: 514.4 on 45193 degrees of freedom 
(1 observation deleted due to missin gness) 

Multiple R-squared: 0.933, Adjusted R-squared: 4.933 

F-statistic; 6,709e+04 om 10 and 48193 DF, p-value: < 2.2e-16 


Listing 1: Results of multiple regression analysis. 


In order to provide evidence for our approach, we 
now use ANOVA test to verify the null and alternative 
hypotheses, defined in equation 6. The source code in 
listing 2, shows the results for this test. ANOVA test is 
based on the sum of squares decomposition. In other 
words, the deviation of an observation from the mean 
can be decomposed as the deviation of the observation 
from the regression-fitted value plus the deviation of the 
fitted value from the mean, that is, we can write (Yi- Y) 


(i-Y)=(%-Y¥+¥-¥%)=H%-Y)+MU-¥K) 


[10]. 


Squaring both sides of equation 4, we obtain: 


(5) 


Where SST, is sum of squares; SSR is sum of squares due 
to regression and SSE is sum of of squares error/residuals. 


Ho : By = Bo =.. = fa Pie 
H; SRo FE) e E il E 


(6) 
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Response; traffic volume 
Sum 5q Mean Sq F value 


h 
rain 17540e-+11 1.7 


Di 

1 

1 
temperature 12 
clouds 12 

L 

1 

L 

l 

1 

L 


thunderstorm 
Residuals 


Signif, codes; O +++ 0.001 ++ O01 + 0.05. 01 


| 
| 
| 
| 
| 
| 


Listing 2: Variance analysis (ANOVA) results. 


Since p-values associated with the F-statistic, for variables 
rain, temperature, clouds, mist, snow drizzle, haze and 
thunderstorm, respectively, are less than 0.05, we have 
enough evidence to reject the null hypothesis and conclude 
there is a significant amount of variation in the response 
that is explained by the proposed model. For variables 
holiday and clear, it is observed a low sum-of-squares 
value and a high p-value, which means there is not much 
variation that can be explained by the those variables. 


Queue results: 


Here our objective is to identify which probability 
distribution is best suited to model the arrivals process for 
the G/M/1 queue type. Typically, the arrival process is 
modeled as Poisson distribution. In this sense, we 
performed a statistical test, called a chi-square, to verify if 
the poisson distribution would be more suitable for this 
scenario. To develop this test, the following hypotheses 
were considered, with a significant level of a = 0.05: 


Ho : The data ~ Poison; 
H, : The data don’t ~ Poison. 


(7) 


To measure the degree of disagreement between observed 
and expected frequencies, we use the summation defined in 
equation 8. 


k > 
=X) 2 Ar (f oi on Gai E (8) 
i=1 ss 
Where, foi represents the observed frequencies; fei the 
expected frequencies and k, the number of classes or 
intervals. The source code in listing 3, shows the result for 
the test in question. 
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# Chi-squared test in R 
# Output 


Chi-squared test for given probabilities 


data: traffie_volume 


X-squared = 58373468, df = 48204, p-value < 2.2¢-16 


Listing 3: Results of Chi-Square test for traffic volume and 
poisson distribution. 


Since P-value is less than 0.05, statistically we have 
enough evidence to reject the null-hipothesis, with a 
significance level of 0.05 and thus accept the alternative 
hypothesis, that the arrivals process does not follow a 
poisson distribution. This fact can also be observed in 
figure 3, in which a comparison between the theoretical 
distribution (poisson) and the empirical data is performed. 


Emp. and theo. distr. Emp. and theo. CDFs 


+ 
- 4 S a 
S r a 
ro 
-> o 
o o = 
24 o 
e o 
=] 
o 
o 
2 s 
z9 m 
5 8 
a o 
a4 + 
So 34 si 
aae 
A a 
oc + 
o o 
N 
o7° 
Nu 
i | o 
o | 
e Jii | S Joe | — empirical 
o o | — theoretical 
Tt tt ko tT F twit E. E, 
2 4 6 8 10 14 2 4 6 8 10 14 
Data Data 


Fig. 3: Comparison between the poisson distribution and 
the empirical data. 


Figure 4 presents Cullen and Frey graph [11], used to 
recognize a distribution among a set of parametric 
distributions on the basis of relations of skewness-kurtosis 
parameters. From a sample (Xi)i ~ (i.i.d) with observations 
(Xi)i , the skewness and kurtosis and their corresponding 
unbiased estimator are given by [12] [13]. 
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= © Observation Theoretical distributions 
2 * normal 
a 4 uniform 
= exponential 
+ logistic 
o * © beta 
- lognormal 
z z gamma 
+ + g (Walbull is dose 19 gamma and lognormat) 


kurtosis 


0 1 2 3 4 


square of skewness 


Fig. 4: Cullen and Frey graph of arrival. 


[E(x a E(X))] 


sk(z) = - (9) 
Var (X)? 
— yn (n— 1) m3 
sk a S mal (10) 
E(X — E(X))* 
kr(x) = pi-api (11) 


-3(n-1)) +3 (12) 


where m2. m3. m4 denote empirical moments defined by 


mk= 257%, (2; —2)" , fork =1,2,3,... and xi 
representing the n observations of variable x and x their 
mean value. Skewness is the degree of asymmetry of a 
distribution. A normal distribution has a skewness value of 
zero [14]. Kurtosis is the degree of peakedness of a 
distribution. Usually taken relative to a normal distribution 
[15]. A distribution having a relatively high peak, is called 
leptokurtic and has a value less than 3, while a distribution 
of flat-topped is called platykurtic, with a value greater 
than 3. The normal distribution, which is not very peaked, 
is called mesokurtic and has a value equal to 3. As a result, 
note that the uniform distribution is closer to the real data 
(observation). Additionally, a comparison between 
empirical and theoretical data is presented in figure 5. 
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Empirical and theoretical dens. Q-Q plot 
————E 3 
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x 3 
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P-P plot 
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a È a 
o w o 
e 2 
o o 
00 02 04 06 08 10 
Data Theoretical probabilities 
Fig. 5: Comparison between the uniform distribution and 


the empirical data. 


The suitability of the uniform distribution is evident, both 
by the Q-Q plot (quantile-quantile plot) and by the P-P 
plot (probability—probability plot or percent—percent plot 
or P value plot) [16], in which the empirical data form 
approximately a straight line when compared to the 
theoretical data. 


V. CONCLUSION 


We propose a multiple linear regression model to 
describe the relationship between traffic volume in the 
Minneapolis-St Paul metropolitan area. The ANOVA test 
shows that we have enough evidence to conclude there is a 
significant amount of variation in 


the response that is explained by the proposed model, by 
the variables rain, temperature, clouds, mist, snow drizzle, 
haze and thunderstorm. Additionally, we present an 
analysis for a model queue with a general arrival process, 
denoted by (G/M/1) : (%/F IF O). Our results indicate that 
the uniform distribution is more suitable to model the 
process of vehicles arrivals. 


The future research includes: 


e Apply other regression models, in order to verify the 
adequacy of these models to the problem in question; 


e Development a Bayesian approach to evaluate arrivals 
and queue service times; 
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